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Abstract. To handle with inverse problems, two probabilistic approaches have been 
proposed: the maximum entropy on the mean (MEM) and the Bayesian estimation 
(BAYES). The main object of this presentation is to compare these two approaches 
which are in fact two different inference procedures to define the solution of an inverse 
problem as the optimizer of a compound criterion. 
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1. Introduction 

Inverse problems arises in many areas of science and engineering. In fact, rarely, 
we can measure directly a quantity x and, in general, the unobserved interested x 
is related to the measured quantity y via a model. In many area this model can 
be written in the general form g = A{x) + n or in the discrete case: 

y = A{x) + n, (1) 

where y stands for the data, x for the unknown variables and n for the errors 
(modeling and noise). Since Newton and Gauss, one tries to define a solution to 
this problem as the optimizer of a criterion, for example the Least Squares (LS): 

X = aigmm {\\y - A{x)\\^} . (2) 

But the inverse problems are, in general, ill-posed and the LS criterion may not 
have a unique optimum or this solution may be very sensitive to noise. Since 
Tikhonov |l ] , the regularization theory became the main approach to give a satis- 
factory solution by defining it as the optimizer of a compound criterion: 

X = a.Tguun{J{x)} with J{x) = Q{x) + \n{x) = \\y - A{x)f + X\\Dx\f (3) 
or in its more general forms 



S = argmin { J(a;)} with J (x) = Q {y — A{x)) + Xn{x , m) . (4) 
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The questions then raised on how to choose the functionals Q and and the 
regularization parameter A and the default solution m. 

The probabilistic approaches started to give partial answers to this request. 
In particular in the Bayesian estimation approach and the maximum a posteriori 
(MAP) estimate: 

X — argmax{|5(cc|y)} = argmin{— \ogp{x\y)} — argmin {— \ogp{y\x) — \ogp{x)} , 

(5) 

this choice is: Q — -~\ogp{y\x) and \Vl{x) = —\ogp{x). This approach just 
pushed a little farther the questions which became how to translate our prior 
knowledge into a probability law and how to determine their parameters. Even, 
nowadays, there are many tools for the estimation of the hypcrparameters ||^, the 
main question on how to translate some knowledge about x into a probability dis- 
tribution stays without a complete answer. The maximum entropy (ME) principle 
gave partial answers [@, H || . See also @ for an extensive discussed bibliography. 

At the same time, many authors used the ME principle to find unique solutions 
to linear inverse problems by considering x as a distribution and the data y as 
linear constraints on them. Then, assuming that the data constraints are satisfied 
by a non empty set of solutions, a unique solution is chosen by maximizing the 
entropy: 



J2xj\ogXj or -Y^Xj log 

3 3 
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{xj - rrij) 



(6) 



where m is default solution. See for example |7|, |^ and the cited references. 
However, even if in these methods, thanks to convex analysis and Lagrangian 
techniques, the constrained optimization of || can be replaced by an equivalent un- 
constrained optimization, the obtained solutions satisfy the uniqueness condition 
of well-posedness but not the stability one In] . 

Recently, some authors |l^ used the ME principle in a different 

way by considering x not as a distribution but as the mean value of a random 
vector X and the data as the constraints on its distribution dP{x). Then, the 
ME principle is used to define it uniquely and finally the solution x is defined as 
the expected value of this ME distribution. 

Following these authors, some others used, commented and analyzed exten- 
sively these ideas 0, |20[|l| , [||, ||, |24[ ||, |2§1 and |2^, |28[|9|, H 



However, in all these works, the data y were considered as exact constraints and 
the errors on the data were either neglected or partially token account of. (See 
however new developments in ||30|] .) 

More recently, some authors who were more faced with real applications, 
|3^ , |3^ , followed the same idea, but by fixing themselves as the objective to 
use these ideas for describing the solution as the optimizer of a combined convex 
criteria such as ^ and more on a constructive way to determine these functionals. 

The objective of this paper is to make a comparison of the Bayesian approach 
which we call hereafter BAYES and the maximum entropy in the mean which we 
refer to as MEM. This comparison is done very pragmatically and is based on the 
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understanding of the author who does not have pretension to know all the details 
of the both approaches and will be happy to discuss all the following discussions 
with the pro of the approaches. 



2. Maximum entropy on the mean approach 
2.1. Basics 

The main references to the basics of this approach are Jl3, pl3[ . The original 
idea and first applications in cr yst allog rap hy are given in |14| , |15| . More details 
and extensions are given in js^, 33, The mathematical aspects of convex 



analysis and duality theorems are given in [^5|, ^ ^ | 

The following resumes the different steps of the approach: 

— Consider a set C, assume that x C and define a reference measure ^Ji{x): 



X eC, 



j xd^{x), 
Jc 



(7) 



where m is the mean value of x under this reference measure. 

Consider x as the mean value of a random vector X for which you assume a 

probability distribution P: 



X = Ep{X} = J xdP{x) 
and the data y as exact equality constraints on it: 

y = Ax = AEp{X} = J AxdP{x). 



(8) 



(9) 



Determine the distribution P by: 

dP{x) 



maximize 



dP{x) s.t. y = Ax = AEp {X} . (10) 



The solution is calculated via Lagrangian: 



dP{x) 



, dP{x) t 
\og—^^y{y-Ax) 
dfj,{x) 



dP{x) 



and is given by: 



where 



dP(a;, A) = exp [\*[Ax] - logZ(A)] dn{x), 

Z(A) = / exp[A*[Aa;]] dn{x). 
Jc 



(11) 

(12) 
(13) 
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The Lagrange parameters are calculated by searching the unique solution (if 
exists) of the following system of non linear equations: 

— |^-^=yi, z = l,...,M. (14) 

— The solution to the inverse problem is then defined as the expected value of 
this distribution: 

8(A) = j X dP{x,\). (15) 

These steps are very formal. In fact, it is possible to determine x(X) in a more 
direct manner. Using the following notations: 

s = A* A, G*{s) = logZ(s) = log j exp [s^x] d/i(a;), (16) 



and 

it is shown that: 



H{x)=m^{s*x-G*{s)}, D(\)=X*y-G*{A*\) (17) 



A = argmax {£'(A)} (Dual criterion) (18) 
A 

X = argmin{iJ(a;)} s.t. y = Ax (Primal criterion), (19) 
x{s) = ^^^^ (ExpUcit relation), (20) 

where: 

— Functions G and H depend on the reference measure fi{x); 

— -D(A) is the dual criterion which is function of the data and the function G; 

— H{x) = H{x, m) is the primal criterion which is a distance measure between 
X and m which means: 

- H{x,m)>Q, and H{x,m) = iff x = m; 

- H{x, m) is differentiable and convex on C; 

- H{x, m) = 00 if X ^C. 

Now, to be able to go a little more in details, let assume that the reference measure 
is separable: 

N 

^l{x) = l[^lJ{xJ) (21) 

then, we have: 

N 

dP{x,X) = '[[ dPj {xj , A) (22) 
i=i 

and 

'^(«) = Yl 9j (sj) , H{x, m) = Y^ hj {xj ,mj), xj = g] (s^). (23) 

3 j 
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Replacing s = A*X we obtain: 
G(A)-^5, ([^*A],), i/(a;,m)=^/j,(x„m,), x, ^ g'J[A%) , (24) 



where hj and gj depend on the reference measure : 
— gj is the log Laplace transform (Cramer transform) of /i^: 



g{s) = log J exp [sx] dfi{x); 



- hj is the convex conjugate of gj: 
Let give some examples: 



h{x) — max{sa; — g{s)}. 



Gaussian: 

Poisson: 
Gamma: 



exp 



m 
IT 



XI 

x"^^ exp 



--{x-mf 
exp [— m] 



X 

m 



9][ 



-(s - m)^ 

exp [m — s\ 
log(s — m) 



/ij(a;, m) 



-(a;-m)2 

— X log \- m — X 

m 

X X 

- log - + 1 

m m 



When ^{x) is not separable it is very difhcult to do the calculation more in details, 
excepted the Gaussian case ^{x) — N{m, Rx), where we have: 

Hix,m)^-^{x-myR-'{x-m), G(A) = -l||Af , i?(A) = A*y + i || Af . 

. ^^^^ 

(See however a new presentation of the method in ||30|| trying to extend the method 
for taking account of the correlations.) 
2.2. Extensions 

How to account for the noise: Two approaches have been developed in 

EM- 

— Replacing the exact equality constraints y = Ax by the following inequalities: 
\y.i - [Ax]^\ < e, or \\y - Axf < (26) 
and using the duality relations they showed: 

X — argmin{_ff(a;)} s.t. \yi — \Ax\i\ < e, with H{x) — hj{xj) 



A = argmax|L>(A)| with i)(A) = D(A) + a|| A|| 



where a depends on e or on a^. 



(27) 
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Replacing y = Ax by y = Ax + n and rewriting it as follows: 



y=[A\I] 



X 

n 



y = Ax (28) 



and assuming ij,{x) = iix{x) Unin) they showed: 

X = argmin{Q(y - Ax) + aH{x)} (29) 

N M 

with H{x) = ^hj{xj), and Q{z) = ^qi{zj). (30) 

j=i i=i 

Here also hj {xj ) and qi (zj) depend on the reference measures iJ.^ {x) and /x„ (a;) . 
The determination of a is not discussed. 



3. Bayesian approach 
3.1. Basics 

The different steps of this approach are now well-known: 

— Prom the observation model and the hypothesis (prior knowledge) on the noise 
derive the likelihood p{y\x; f3); 

— From the hypothesis (prior knowledge) on x derive the prior law p{x\6); 

— Apply the Bayes rule to obtain p{x\y; f3, 0) = p{y\x; (3) p{x\0)/p{y; (3,6); 

— Define an estimation rule via a cost function c{x, x) by: 

x{y; /3, 9) = argmin | J c{x, z)p{x\y; (3, 9) da;| . (31) 

Different cost functions give different estimators: 

— Maximum a posteriori (MAP): 

C{x, x) = 1 — 6{x — x) — > X = argnmx{p(a;|y; 6,/3)} . (32) 

— Posterior mean (PM): 

C{x,x) = [x-xYQ[x-xY — ^x = Ex\y{X} = j xp{x\y;e,/3)dx. (33) 

— Maximum of the Marginal a posteriori (MMAP): 

C{x, x) =\ \ 1 — 6{xj — Xj) — > Xj = argmax{p(xo|y; 9)} , (34) 



where 



p{xj\y;9) = j p{x\y;9)dxi- ■ ■ dxj-i- ■ ■ dxj+i- ■ ■ dxn- (35) 
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To illustrate this, let consider the case of linear inverse problems y = Ax + n with 
the following hypothesis: 
— n is zero-mean, white and Gaussian: n ~ A/^ (0, (1//3)/) which leads to: 



p{y\x,P) oc exp 
— a; is Gaussian: x ^ M {xo,{l/6)Po): 



1 



HWv-AxW 



^e[x - Xof Pq^[x - Xq] 



(36) 



(37) 



p{x\9) oc exp 
Then, using the Bayes rule it is easy to show that 

x\y--N'{x,P) with x = PA\y- Axo), P = (^*A + APq ^)~^ • (38) 
The MAP solution is: 

X = argnmx;{p(x|y)} = argnMn{J(a;)} , with J{x) = Q{x) + X(l){x), (39) 



where 



Q{x) = \\y-Ax\\ 



x)^x''Pq''x=\\Dx\\, A = - 



(40) 



Now, relaxing the second hypothesis, i.e; choosing other prior laws for x we obtain 
other MAP criteria. Let just note some special interesting cases: 

— A Generalized Gaussian law for x: 



p{xj) (x exp [—{xj — ?7ij)"] . 
The related MAP criterion becomes: 

J{x) = Q{x) + 4>{x) with (j){x) = ^y^^jxj — ruj)". 

3 

— A Gamma law for x: 

Xj ~ 5 (a, raj) — > Pixj) cx {xj/mj)~°' exp [—Xj/mj] . 
The related MAP criterion becomes: 

J(x) = Q(x) + (j)(x) with (j)(x) alog— + 



(41) 
(42) 

(43) 
(44) 



— A Beta law for x: 

Xj ~ B{a, (3) — > p{xj) oc a;^(l - Xjf . (45) 
The related MAP criterion becomes: 

J{x) = Q{x) + (j){x) with (j){x) ^ a^logXj + l3^\og{l- Xj). (46) 
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A Poisson law for x: 



j_ 

Xjl 

The related MAP criterion becomes: 



p{xj) oc — ^ exp [— mj] . (47) 



J{x) = Q{x) + 4>{x) with 4>{x) ^ ~^Xj log — + {xj - mj). (48) 



Markovian models for x: 



J{x) = Q{x) + (j){x) with (j){x) = V{xj,x,). (49) 

3.2. Extensions 

The Bayesian approach can be exactly applied when all the direct (prior) prob- 
ability laws {p{y\x,(3) and p{x\6)) are assigned. Even, choosing an appropriate 
law is done in general by hand, another difficulty is to determine their parameters 
(/3,0). This problem has been addressed by many authors and the subject is an 
active area in statistics. See 0, ||, |39[|o| , || ||, H, || and also g, H . 

All these methods can mainly be divided in three main families: 

— Generalized MAP: In this approach one tries to estimate both the hyperpa- 
rameters and the unknown variables x directly from the data by defining: 

(2,0,3) = arg max {p{x,e, f3\y)} (50) 
{x,9,f3) 

where 

p{x,e,f3\y)^piy\x,/3)p{x\e)pie)pif3) (51) 

and where p{6) and p{(3) are appropriate prior laws. Many authors used the 
non informative prior law for them. 

— Marginalization: In this approach one tries to estimate first the hyperparam- 
eters by marginalizing over the unknown variables x: 

pie,/3\y) ^p{f3)p{e) J piy\x,f3)p{x\e)dx (52) 

and then, using them in the estimation of the unknown variables x: 

(g,3) = argmax{p(0,/3|y)} (53) 
(^,/3) 

— Nuisance parameters: In this approach the hyperparameters are considered 
as the nuisance parameters, so, marginalized: 

pix\y) = J p{y,x,e,l3)de&(3 (54) 

and the unknown variables x are estimated by: 

a; = argniax{p(x|y)} (55) 

To see some more discussions and different possible implementations of these 
approaches see |48|. 
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4. Discussed points in each approach 

4.1. Mem: 

— Choice of C and fJ-{x): 

- C must be a convex set, such as: R^, H^, [a, b]^ 

- Up to now, the whole analysis can be done for separable measures. 

- The only reference measures Hj which permits to go through all the steps 
are those for which we have analytical expression for the Laplace transform 
of their logarithms. 

— Accounting for the noise: 

In the first approach only the support and the energy of the noise is used. In 
the second approach we have more choices via the reference measures /x„ (n) , 
but determination of a stays adhoc. In fact, in general, when the reference 
measures iJ,n{n) and ^j.x{x) depend on any parameters, this approach lacks 
any tool to determine them. 

— Effective calculation of the solution: 

No problem, and more, this is probably the main interest of the approach 
which defines the solution, by construction, as the optimizer of a convex cri- 
terion. 

— Characterization of the solution: 

A sensitivity analysis has been proposed by ]3^ , but, in my opinion, this is 
not enough to characterize a solution to an inverse problem. 
It is not easy to use the notions of variance or covariance of the solution, 
because this approach does not define a posterior distribution for the solution. 

— Possibility of the extension of the approach: 

I have not yet seen any extension of this approach to non linear inverse prob- 
lems, or linear inverse problems in which the operator depend on unknown 
parameters, such as blind deconvolution or antenna array processing; 
The fact that we have to choose a convex set C on which the solution is de- 
fined excludes the inverse problems in which we know a priori that the solution 
is discrete- valued (binary, nary images for example). This excludes the use 
of this approach in image segmentation or communication inverse problems 
(canal equalization, blind deconvolution, etc.). 

4.2. Bayes: 

— Choice of p{x\f3): 

p{x\/3) can be chosen separable or not. Evidently, separable p( a; |/3) (Entropic 
prior laws) simplifies the calculations. Accounting for correlations is easily 
done via Markovian models; In both cases (Entropic or Markovian prior laws), 
there are some tools for choosing them either by physical considerations, or 
by scale invariance arguments |5^, |5^, |52| |3j . 

— Choice of the cost function or equivalently of an estimator MAP, PM, MMAP: 
This choice is done more on the basis of cost calculation. MAP calculation 
needs, in general, global optimization, but does not need any integration. 
MP or MMAP needs multidimensional integration, so in general, greater cost. 
However, there are approximate calculation techniques based on Monte Carlo 
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methods and Gibbs sampling. 

Effective calculation of different solutions: 

For MAP estimate, when the posterior law is unimodal, we can use any gradi- 
ent descent based method, but if this is not the case, there are two categories 
of methods: Simulated Annealing or Deterministic relaxation (GNC). For 
more discussions on Bayesian calculations see 1 54 in this volume. 



5. Comparisons and discussions 

The following main items are discussed: 

— In MEM, the unknowns x are considered as the mean values of a random 
vector X for which a prior probability measure dfi(x) is defined. 

— In BAYES, the unknowns x are considered as a sample of a random vector 
X for which a prior probability measure p{x) is defined. 

— In MEM, a probability distribution p{x) is defined as the minimizer of the 
KuUback distance K{p, ji) subject to the data constraints, and the solution 
is defined to be Ep(X). What is interesting here is that this solution can 
equivalently be obtained as the minimizer of a convex criterion J{x) subject 
to the data constraints, and what is more attractive is that, thanks to the 
convex analysis, this solution can also be obtained as the stationary point of 
a dual criterion which can easily be calculated numerically. 

— In BAYES, the posterior \sm p{x\y) is calculated using the Bayes' rule. In fact, 
the data y are considered as a sample of a random vector Y for which we can 
define a conditional probability law p{y\x) which, when used in conjunction 
with the prior p{x) in the Bayes' rule will give us the posterior law, from 
which we can define an estimator. One of these estimators is the posterior 
mean Ep(X), but others can also be defined. This posterior law is used 
not only to define an estimate (a solution), but also to calculate any other 
probabilistic information about that solution. 

— In MEM, in its original version, the data are considered as the exact linear con- 
straints. The uncertainty on the data are not considered, and consequently, 
the uncertainty on the solution is not handled. However, some extensions are 
recently presented to take account of the errors on the data and to calculate 
the sensitivity of the solution to these errors. 

— In BAYES, the errors are naturally considered through p(y\x) and the un- 
certainty of the solution through the posterior probability p{x\y). Naturally 
then, we can compare the information content of the data and the prior model 
using their entropies. We can also measure the relative information content 
of the posterior to prior model by K {p{x\y) , p{x)) . 

— In MEM, even in their extended versions, it is not easy to handle with the 
hyperparameters. In BAYES, there are the necessary tools to handle them. 

— In MEM, one can not yet handle with non linear problems. This is not the 
case of the BAYES. 

As a final conclusion, we have to mention that, even if the two approaches are 
different, they can, in some cases result to the same definition of the solution as 
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the minimizer of the same criterion, and consequently, to give exactly the same 
numerical solutions to a given inverse problem. However, we can give different 
interpretations to the obtained result depending on the approach used to reach it. 
The main objective of this paper was to give a succinct presentation of the two 
approaches for the resolution of the inverse problems. 

It is important to note that the two approaches give different views and inter- 
pretations which can be used advantageously for any application. Also, even the 
Bayesian approach is now really mature, the MEM approach is more recent. So, 
many of the conclusions I made today may be altered in future. In particular, new 
presentation of the method by Heinrich et al in this volume will probably give 
new possibilities to the MEM method and will push the limits of the method to 
greatest generality. 
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